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Abstract 

Distributed optimization methods for large-scale 
machine learning suffer from a communication 
bottleneck. It is difficult to reduce this bottleneck 
while still efficiently and accurately aggregating 
partial work from different machines. In this pa¬ 
per, we present a novel generalization of the re¬ 
cent communication-efficient primal-dual frame¬ 
work (CoCoA) for distributed optimization. Our 
framework, CoCoA + , allows for additive com¬ 
bination of local updates to the global parame¬ 
ters at each iteration, whereas previous schemes 
with convergence guarantees only allow conser¬ 
vative averaging. We give stronger (primal-dual) 
convergence rate guarantees for both CoCoA as 
well as our new variants, and generalize the the¬ 
ory for both methods to cover non-smooth con¬ 
vex loss functions. We provide an extensive ex¬ 
perimental comparison that shows the markedly 
improved performance of CoCoA + on several 
real-world distributed datasets, especially when 
scaling up the number of machines. 


Proceedings of the 32 nd International Conference on Machine 
Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copy¬ 
right 2015 by the author(s). 


1. Introduction 

With the wide availability of large datasets that exceed 
the storage capacity of single machines, distributed opti¬ 
mization methods for machine learning have become in¬ 
creasingly important. Existing methods require significant 
communication between workers, frequently equaling the 
amount of local computation (or reading of local data). As 
a result, distributed machine learning suffers significantly 
from a communication bottleneck on real world systems, 
where communication is typically several orders of magni¬ 
tudes slower than reading data from main memory. 

In this work we focus on optimization problems with em¬ 
pirical loss minimization structure, i.e., objectives that are 
a sum of the loss functions of each datapoint. This in¬ 
cludes the most commonly used regularized variants of 
linear regression and classification methods. For this 
class of problems, the recently proposed CoCoA approach 
(Yang, 2013; Jaggi et ah, 2014) develops a communication- 
efficient primal-dual scheme that targets the communica¬ 
tion bottleneck, allowing more computation on data-local 
subproblems native to each machine before communica¬ 
tion. By appropriately choosing the amount of local com¬ 
putation per round, this framework allows one to control 
the trade-off between communication and local computa¬ 
tion based on the systems hardware at hand. 

However, the performance of CoCoA (as well as related 
primal SGD-based methods) is significantly reduced by the 
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need to average updates between all machines. As the 
number of machines K grows, the updates get diluted and 
slowed by 1/K, e.g., in the case where all machines ex¬ 
cept one would have already reached the solutions of their 
respective partial optimization tasks. On the other hand, if 
the updates are instead added, the algorithms can diverge, 
as we will observe in the practical experiments below. 

To address both described issues, in this paper we develop 
a novel generalization of the local CoCoA subproblems 
assigned to each worker, making the framework more pow¬ 
erful in the following sense: Without extra computational 
cost, the set of locally computed updates from the mod¬ 
ified subproblems (one from each machine) can be com¬ 
bined more efficiently between machines. The proposed 
CoCoA + updates can be aggressively added (hence the 
“+’-suffix), which yields much faster convergence both in 
practice and in theory. This difference is particularly sig¬ 
nificant as the number of machines K becomes large. 

1.1. Contributions 

Strong Scaling. To our knowledge, our framework is the 
first to exhibit favorable strong scaling for the class of prob¬ 
lems considered, as the number of machines K increases 
and the data size is kept fixed. More precisely, while the 
convergence rate of CoCoA degrades as K is increased, 
the stronger theoretical convergence rate here is - in the 
worst case - independent of K. Our experiments in Section 
7 confirm the improved speed of convergence. Since the 
number of communicated vectors is only one per round and 
worker, this favorable scaling might be surprising. Indeed, 
for existing methods, splitting data among more machines 
generally increases communication requirements (Shamir 
& Srebro, 2014), which can severely affect overall runtime. 

Theoretical Analysis of Non-Smooth Losses. While the 
existing analysis for CoCoA in (Jaggi et ah, 2014) only 
covered smooth loss functions, here we extend the class 
of functions where the rates apply, additionally covering, 
e.g.. Support Vector Machines and non-smooth regression 
variants. We provide a primal-dual convergence rate for 
both CoCoA as well as our new method CoCoA + in the 
case of general convex (L-Lipschitz) losses. 


Arbitrary Local Solvers. CoCoA as well as CoCoA + 
allow the use of arbitrary local solvers on each machine. 

Experimental Results. We provide a thorough experi¬ 
mental comparison with competing algorithms using sev¬ 
eral real-world distributed datasets. Our practical results 
confirm the strong scaling of CoCoA + as the number of 
machines K grows, while competing methods, including 
the original CoCoA, slow down significantly with larger 
K. We implement all algorithms in Spark, and our code is 
publicly available at: github.com/gingsmith/cocoa. 

1.2. History and Related Work 

While optimal algorithms for the serial (single machine) 
case are already well researched and understood, the liter¬ 
ature in the distributed setting is relatively sparse. In par¬ 
ticular, details on optimal trade-offs between computation 
and communication, as well as optimization or statistical 
accuracy, are still widely unclear. For an overview over 
this currently active research field, we refer the reader to 
(Balcan et ah, 2012; Richtarik & Takac, 2013; Duchi et ah, 
2013; Yang, 2013; Liu & Wright, 2014; Fercoq et ah, 2014; 
Jaggi et ah, 2014; Shamir & Srebro, 2014; Shamir et ah, 
2014; Zhang & Lin, 2015; Qu & Richtarik, 2014) and the 
references therein. We provide a detailed comparison of 
our proposed framework to the related work in Section 6. 

2. Setup 

We consider regularized empirical loss minimization prob¬ 
lems of the following well-established form: 

min j"P(w) := - ^^(xfw) + ^||w|| 2 ] (1) 

weK d I n 2 I 

Here the vectors {x,}" =1 C represent the training data 
examples, and the (.) are arbitrary convex real-valued 
loss functions (e.g., hinge loss), possibly depending on la¬ 
bel information for the i-th datapoints. The constant A > 0 
is the regularization parameter. 

The above class includes many standard problems of wide 
interest in machine learning, statistics, and signal process¬ 
ing, including support vector machines, regularized linear 
and logistic regression, ordinal regression, and others. 


Primal-Dual Convergence Rate. Furthermore, we addi¬ 
tionally strengthen the rates by showing stronger primal- 
dual convergence for both algorithmic frameworks, which 
are almost tight to their objective-only counterparts. 
Primal-dual rates for CoCoA had not previously been ana¬ 
lyzed in the general convex case. Our primal-dual rates al¬ 
low efficient and practical certificates for the optimization 
quality, e.g., for stopping criteria. The new rates apply to 
both smooth and non-smooth losses, and for both CoCoA 
as well as the extended CoCoA + . 


Dual Problem, and Primal-Dual Certificates. The con¬ 
jugate dual of (1) takes following form: 


max < V(a) 

a£P 


1 

n 




i=i 


A 

2 



( 2 ) 


Here the data matrix A = [x 1} x 2 ,..., x„] G R dxn col¬ 
lects all data-points as its columns, and £* is the conjugate 
function to £ :i . See, e.g., (Shalev-Shwartz & Zhang, 2013c) 
for several concrete applications. 
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It is possible to assign for any dual vector a G R" a corre¬ 
sponding primal feasible point 

w(a) = (3) 

The duality gap function is then given by: 

G(a) := P(w(a)) — T>(a) (4) 

By weak duality, every value T>(a) at a dual candidate a 
provides a lower bound on every primal value 'P(w). The 
duality gap is therefore a certificate on the approxima¬ 
tion quality: The distance to the unknown true optimum 
V(yv*) must always lie within the duality gap, i.e., G(a) = 
"P(w) — T>(a) > 'P(w) — V(w*) > 0. 

In large-scale machine learning settings like those consid¬ 
ered here, the availability of such a computable measure of 
approximation quality is a significant benefit during train¬ 
ing time. Practitioners using classical primal-only methods 
such as SGD have no means by which to accurately detect 
if a model has been well trained, as P( w*) is unknown. 

Classes of Loss-Functions. To simplify presentation, we 
assume that all loss functions 4 are non-negative, and 

4(0) < 1 Vi (5) 

Definition 1 (L-Lipschitz continuous loss). A function 4 : 
R —> R is L-Lipschitz continuous if Va, b G R, we have 

|4(a) - 4(6)| < L\a - b\ ( 6 ) 

Definition 2 ((l//x)-smooth loss). A function 4 : R —> R 
is (l//x)-smooth if it is differentiable and its derivative is 
(l//u)-Lipschitz continuous, i.e., Va, b £ R, we have 

K(o)-^)l<-|a-&l (7) 

fl 

3. The CoCoA + Algorithm Framework 

In this section we present our novel CoCoA + frame¬ 
work. CoCoA + inherits the many benefits of CoCoA as 
it remains a highly flexible and scalable, communication- 
efficient framework for distributed optimization. CoCoA + 
differs algorithmically in that we modify the form of the lo¬ 
cal subproblems (9) to allow for more aggressive additive 
updates (as controlled by 7 ). We will see that these changes 
allow for stronger convergence guarantees as well as im¬ 
proved empirical performance. Proofs of all statements in 
this section are given in the supplementary material. 

Data Partitioning. We write {Vk\k=i f° r the given par¬ 
tition of the datapoints [n] := { 1 , 2 ,..., n} over the I\ 
worker machines. We denote the size of each part by 
n-k = \Pk\- For any k G [/\] and a G R n we use the 
notation a G R" for the vector 

( v ._ JO, if i£Vk, 

^ 1 1 cti, otherwise. 


Local Subproblems in CoCoA + . We can define a data- 
local subproblem of the original dual optimization problem 
( 2 ), which can be solved on machine k and only requires 
accessing data which is already available locally, i.e., data¬ 
points with i G Vk- More formally, each machine k is as¬ 
signed the following local subproblem, depending only on 
the previous shared primal vector w G R d , and the change 
in the local dual variables a,; with i G Vk'- 

max Ql (Aa[w;w,a[w) ( 8 ) 

A a[ fc ]GR" 


where 


^'(Aa w ;w,Q[ t] ) := -- ^ 4*(“ a * - ( Aa [t])i) 


i<ZV k 

-!l w !| 2 -W T AAa!ru — —cr' 

K 2 11 n 11 2 


-^-AAa 

X n 


[fc] 


(9) 


Interpretation. The above definition of the local objec¬ 
tive functions Q% are such that they closely approximate 
the global dual objective V, as we vary the ‘local’ vari¬ 
able Aajfc], in the following precise sense: 

Lemma 3. For any dual a , Aa G R", primal w = w(a) 
and real values 7 , a' satisfying ( 11 ), it holds that 
K 

V { a + l'^2 Aa w) - ( 1 ^7)7 ? ( a ) 

fc =1 K 

(10) 

k =1 

The role of the parameter cr' is to measure the difficulty of 
the given data partition. For our purposes, we will see that 
it must be chosen not smaller than 


cr' > a' := 7 max 
- mm a6R ™ 


l|Aa|| 2 

Ef = 1 l|Aa [fe] || 2 


( 11 ) 


In the following lemma, we show that this parameter can 
be upper-bounded by yK, which is trivial to calculate for 
all values 7 G R. We show experimentally (Section 7) that 
this safe upper bound for a' has a minimal effect on the 
overall performance of the algorithm. Our main theorems 
below show convergence rates dependent on 7 G [ ^, 1], 
which we refer to as the aggregation parameter. 

Lemma 4. The choice of a' := 7 K is valid for (11), i.e.. 


7 K > a' ■ 

I — min 


Notion of Approximation Quality of the Local Solver. 
Assumption 1 (©-approximate solution). We assume that 
there exists 0 G [0,1) such that\/k G [K\, the local solver 
at any outer iteration t produces a (possibly) randomized 
approximate solution Accru, which satisfies 

E[^'(Aa( fe] ;w,a [fc] ) - £/fc'(Aa [fc] ;w,a [fc] )] (12) 

< © (^'(Aaj^w.ajfc]) -0fc'(O;w,a [fc] )) , 
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where 

Hi S argmax ^(Aom; w,Vfc £ [I<] (13) 

AaER n 

We are now ready to describe the CoCoA + framework, 
shown in Algorithm 1. The crucial difference compared 
to the existing CoCoA algorithm (Jaggi et al., 2014) is the 
more general local subproblem, as defined in (9), as well as 
the aggregation parameter 7. These modifications allow the 
option of directly adding updates to the global vector w. 


Algorithm 1 CoCoA + Framework 
1 : Input: Datapoints A distributed according to parti¬ 
tion {Vk\k=i- Aggregation parameter 7 £ (0,1], 
subproblem parameter a' for the local subproblems 
£7£'(Aa[ fc ];w,a[ fc ]) for each k £ [K\. 

Starting point a:^ := 0 £ R”, := 0 £ R d . 

2: for £ = 0,1,2,... do 

3: for k £ {1,2,... , K} in parallel over computers 

do 

4: call the local solver, computing a ©-approximate 

solution Actru of the local subproblem (9) 

5: update 1) := c*[‘j + 7 A a [k ] 

6 : return Aw fc := ^AAa^] 

7: end for 

8 : reduce := w® + 7^f=i Aw/-. (14) 

9: end for 


4. Convergence Guarantees 

Before being able to state our main convergence results, 
we introduce some useful quantities and the following 
main lemma characterizing the effect of iterations of Al¬ 
gorithm 1 , for any chosen internal local solver. 

Lemma 5. Let l* be strongly 1 convex with convexity pa¬ 
rameter p > 0 with respect to the norm || • ||, Vi £ [n]. Then 
for all iterations t of Algorithm 1 under Assumption 1, and 
any s £ [ 0 , 1 ], it holds that 


E[D(a' 



(15) 


7 ( 1 -0)(sG(a 

.(*)) _ 

’ 2 A V 


where 





P(‘) — A/m(l—s) 11 (t) 

a's N 

- a (4) || 2 

(16) 


+ Ef = 1 P(u (t) - 

aW )[k}\\ 2 , 


for u® 

£ R" with 




— £ difi w(a^ 

) T x,)- 

(17) 


'Note that the case of weakly convex £*(■) is explicitly al¬ 
lowed here as well, as the Lemma holds for the case p = 0. 


The following Lemma provides a uniform bound on R^: 
Lemma 6. If V, are L-Lipschitz continuous for all i £ [n], 
then K 


Vt:ilW<4L 2 5>*»*, (18) 


k =1 


where 


(jfc := max 

ot[ k ]eR n 


P«wll 2 

ll«wll 2 ■ 


(19) 


Remark 7. If all data-points x, are normalized such that 
I HI | < 1 Vi £ [n], then crk < \Vk\ = rik ■ Furthermore, 
if we assume that the data partition is balanced, i.e., that 
nk = n/ K for all k, then a < n 2 / K. This can be used to 
bound the constants R^\ above, as R^ < 4L " . 


4.1. Primal-Dual Convergence for General Convex 
Losses 


The following theorem shows the convergence for non¬ 
smooth loss functions, in terms of objective values as well 
as primal-dual gap. The analysis in (Jaggi et al., 2014) only 
covered the case of smooth loss functions. 


Theorem 8. Consider Algorithm 1 with Assumption 1. Let 
£j(-) be L-Lipschitz continuous, and e G > 0 be the de¬ 
sired duality gap (and hence an upper-bound on primal 
sub-optimality). Then after T iterations, where 


T > T 0 + max{ 
Tq > to + 


1 


AL 2 o(j' 


7(1 — 0) ’ A?t 2 e G 7(l — ©) 


/ 8 L 2 —' 


7(1 — 0) \ A ti 2 cg 


2Xn 2 (V(a*)-V(a m )) 


to > max( 0 , log( " 4 ° 2 


}, ( 20 ) 

) 1 ), 


we have that the expected duality gap satisfies 


E[7 7 (w(a)) - V(a)\ < e G , 


at the averaged iterate 


a 


1 y-^T—1 

T-T 0 Z-jt=T 0 + l 


aA 4 ). 


( 21 ) 


The following corollary of the above theorem clarifies our 
main result: The more aggressive adding of the partial up¬ 
dates, as compared averaging, offers a very significant im¬ 
provement in terms of total iterations needed. While the 
convergence in the ‘adding’ case becomes independent of 
the number of machines K, the ‘averaging’ regime shows 
the known degradation of the rate with growing K, which is 
a major drawback of the original CoCoA algorithm. This 
important difference in the convergence speed is not a the¬ 
oretical artifact but also confirmed in our practical experi¬ 
ments below for different K, as shown e.g. in Figure 2. 

We further demonstrate below that by choosing 7 and a' 
accordingly, we can still recover the original CoCoA al¬ 
gorithm and its rate. 
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Corollary 9. Assume that all datapoints x, are bounded as 
|x,;|| < 1 and that the data partition is balanced, i.e. that 
tik = n/K for all k. We consider tw’o different possible 
choices of the aggregation parameter 7 : 

• (C 0 C 0 A Averaging, 7 := -/=): In this case, o' := 1 
is a valid choice which satisfies (11). Then using o < 
n 2 / K in light of Remark 7, we have that T iterations 
are sufficient for primal-dual accuracy e G , with 


T > T 0 + max{ 


To > to + 


K 

1-0 
2 K ( 8 L 2 


41/ 


’ Ae G (l - 0) 


}, 


1 - 0 \XKe G 


- 1 


t 0 > max(0, [ 3 ^ l°g( 2A(P(a 4 A-L? (a ' ^ l) 


Hence the more machines K, the more iterations are 
needed (in the worst case). 

• (CoCoA + Adding, 7 := 1); In this case, the choice of 
o' \= K satisfies (11). Then using o < n 2 /K in light 
of Remark 7, we have that T iterations are sufficient 
for primal-dual accuracy e G , with 


T > Tq + max{ 
Tq > to + 


1 


4 L 2 


1-0 
8 L 2 


1-0 \ Xe G 


’ Ae G (l - 0) 

- 1 


}, 


t 0 > max(0, [log( 2A " (P( °^ L: T (a(0>)) )J) 


This is significantly better than the averaging case. 

In practice, we usually have er <C n 2 /K, and hence the 
actual convergence rate can be much better than the proven 
worst-case bound. Table 1 shows that the actual value of 
o is typically between one and two orders of magnitudes 
smaller compared to our used upper-bound n 2 /K. 


2 

Table 1. The ratio of upper-bound r ' K divided by the true value of 
the parameter < 7 , for some real datasets. 


K 

16 

32 

64 

128 

256 

512 

news 

15.483 

14.933 

14.278 

13.390 

12.074 

10.252 

real-sim 

42.127 

36.898 

30.780 

23.814 

16.965 

11.835 

rcvl 

40.138 

23.827 

28.204 

21.792 

16.339 

11.099 

K 

256 

512 

1024 

2048 

4096 

8192 

covtype 

17.277 

17.260 

17.239 

16.948 

17.238 

12.729 


4.2. Primal-Dual Convergence for Smooth Losses 

The following theorem shows the convergence for smooth 
losses, in terms of the objective as well as primal-dual gap. 

Theorem 10. Assume the loss functions functions t, are 
(1 / ff-smooth Vi G [n]. We define cr max = max feg r A -] Ok- 
Then after T iterations of Algorithm 1, with 


T > 


1 A/m+cr max cr' 1 J_ 
7(1 — 0) A/m ® ex) > 


it holds that 

E[D(a*) -X>(a (T) )] < ep. 


Furthermore, after T iterations with 


T > 


~ 7(1-6)’ 


A /m+(Tn 


A fin 


~ lo ® ( 7 ( 1 ^ 6 ) 


A/i.n+O’rx 


A [in 


we have the expected duality gap 

E[lP(w(a (T) )) - V{a (T ">)} < e G . 


The following corollary is analogous to Corollary 9, but 
for the case of smooth loses. It again shows that while the 
C 0 C 0 A variant degrades with the increase of the number 
of machines K, the CoCoA + rate is independent of K. 

Corollary 11. Assume that all datapoints x, are bounded 
as || x ?:|| V 1 and that the data partition is balanced, i.e., 
that Hk = n/K for all k. We again consider the same two 
different possible choices of the aggregation parameter 7 : 

• (C 0 C 0 A Averaging, 7 := jt): In this case, o' := 
1 is a valid choice which satisfies (11). Then using 
a max < rife = n/K in light of Remark 7, we have that 
T iterations are sufficient for suboptimality e-p, with 


T > 


1 \fiK-\-l 

1-© v 


log 


1 

€t> 


Hence the more machines K, the more iterations are 
needed (in the worst case). 

• (CoCoA + Adding, 7 := 1): In this case, the choice 
of o' := K satisfies (11). Then using cr max < nk = 
n/K in light of Remark 7, we have that T iterations 
are sufficient for suboptimality e-p, with 


r r 1 A/x+i 

± — 1-0 A 11 


log 


1 

ed 


This is significantly better than the averaging case. 
Both rates hold analogously for the duality gap. 


4.3. Comparison with Original C 0 C 0 A 

Remark 12. If we choose averaging (7 := ) for aggre¬ 

gating the updates, together with o' := 1, then the result¬ 
ing Algorithm 1 is identical to C 0 C 0 A analyzed in (Jaggi 
et al., 2014). However, they only provide convergence for 
smooth loss functions ti and have guarantees for dual sub¬ 
optimality and not the duality gap. Formally, when o' = 1, 
the subproblems (9) will differ from the original dual T>{.) 
only by an additive constant, which does not affect the local 
optimization algorithms used within C 0 C 0 A. 


5. SDCA as an Example Local Solver 

We have shown convergence rates for Algorithm 1, depend¬ 
ing solely on the approximation quality 0 of the used local 
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solver (Assumption 1). Any chosen local solver in each 
round receives the local a variables as an input, as well as 

( 3 ) 

a shared vector w = w(a) being compatible with the last 
state of all global a £ R" variables. 

As an illustrative example for a local solver. Algorithm 2 
below summarizes randomized coordinate ascent (SDCA) 
applied on the local subproblem (9). The following two 
Theorems (13, 14) characterize the local convergence for 
both smooth and non-smooth functions. In all the results 
we will use r max := max ie[ „] 11x.^11 2 . 


Algorithm 2 LocalSDCA (w, a^], k, H) 

1: Input: ct[fc],w = w(a) 

2: Data: Local {(xj,t/j)} ie:Pfc 
3: Initialize: Aa|^j := 0 e I" 

4: for h = 0,1,..., H — 1 do 
5: choose i £ Vk uniformly at random 

6 : 8* := argmax Qf. (AaS + ^e,; w, at [k i) 

<5;eK 

7 - + 8*e- 

8 : end for 
9: Output: Aanjp 


Theorem 13. Assume the functions are (1/ ji)—smooth 
for i £ [n], Then Assumption 1 on the local approximation 
quality O is satisfied for LocalSDCA as given in Algo¬ 
rithm 2, if we choose the number of inner iterations H as 


H >n k 


a r, 


iax + A np 1 

A np ° S 6' 


( 22 ) 


Theorem 14. Assume the functions t, are L-Lipschitz for 
i £ [n]. Then Assumption 1 on the local approximation 
quality O is satisfied for LOCALSDCA as given in Algo¬ 
rithm 2, if we choose the number of inner iterations H as 


H >n k 


1-0 

0 


® ^max 

20An 2 g°\A 


|Aaf fc] f 


0 ;.) 


(23) 


Remark 15. Between the different regimes allowed in 
CoCoA + (ranging between averaging and adding the up¬ 
dates) the computational cost for obtaining the required 
local approximation quality varies with the choice of o'. 
From the above worst-case upper bound, we note that the 
cost can increase with o', as aggregation becomes more 
aggressive. However, as we will see in the practical exper¬ 
iments in Section 7 below, the additional cost is negligible 
compared to the gain in speed from the different aggrega¬ 
tion, when measured on real datasets. 


6. Discussion and Related Work 

SGD-based Algorithms. For the empirical loss mini¬ 
mization problems of interest here, stochastic subgradient 


descent (SGD) based methods are well-established. Sev¬ 
eral distributed variants of SGD have been proposed, many 
of which build on the idea of a parameter server (Niu et al., 
2011; Liu et al., 2014; Duchi et al., 2013). The downside of 
this approach, even when carefully implemented, is that the 
amount of required communication is equal to the amount 
of data read locally (e.g., mini-batch SGD with a batch size 
of 1 per worker). These variants are in practice not compet¬ 
itive with the more communication-efficient methods con¬ 
sidered here, which allow more local updates per round. 

One-Shot Communication Schemes. At the other ex¬ 
treme, there are distributed methods using only a single 
round of communication, such as (Zhang et al., 2013; 
Zinkevich et al., 2010; Mann et al., 2009; McWilliams 
et al., 2014). These require additional assumptions on the 
partitioning of the data, and furthermore can not guarantee 
convergence to the optimum solution for all regularizes, as 
shown in, e.g., (Shamir et al., 2014). (Balcan et al., 2012) 
shows additional relevant lower bounds on the minimum 
number of communication rounds necessary for a given ap¬ 
proximation quality for similar machine learning problems. 

Mini-Batch Methods. Mini-batch methods are more 
flexible and lie within these two communication vs. com¬ 
putation extremes. However, mini-batch versions of both 
SGD and coordinate descent (CD) (Richtarik & Takac, 
2013; Shalev-Shwartz & Zhang, 2013b; Yang, 2013; Qu 
& Richtarik, 2014; Qu et al., 2014) suffer from their con¬ 
vergence rate degrading towards the rate of batch gradient 
descent as the size of the mini-batch is increased. This fol¬ 
lows because mini-batch updates are made based on the 
outdated previous parameter vector w, in contrast to meth¬ 
ods that allow immediate local updates like CoCoA. Fur¬ 
thermore, the aggregation parameter for mini-batch meth¬ 
ods is harder to tune, as it can lie anywhere in the order of 
mini-batch size. In the CoCoA setting, the parameter lies 
in the smaller range given by K. Our CoCoA + extension 
avoids needing to tune this parameter entirely, by adding. 

Methods Allowing Local Optimization. Developing 
methods that allow for local optimization requires care¬ 
fully devising data-local subproblems to be solved after 
each communication round. (Shamir et al., 2014; Zhang 
& Lin, 2015) have proposed distributed Newton-type algo¬ 
rithms in this spirit. However, the subproblems must be 
solved to high accuracy for convergence to hold, which is 
often prohibitive as the size of the data on one machine is 
still relatively large. In contrast, the CoCoA framework 
(Jaggi et al., 2014) allows using any local solver of weak 
local approximation quality in each round. By making use 
of the primal-dual structure in the line of work of (Yu et al., 
2012; Pechyony et al., 2011; Yang, 2013; Lee & Roth, 
2015), the CoCoA and CoCoA + frameworks also allow 
more control over the aggregation of updates between ma- 
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Figure 1. Duality gap vs. the number of communicated vectors, as well as duality gap vs. elapsed time in seconds for two datasets: 
Covertype (left, K= 4) and RCV1 (right, A'=8). Both are shown on a log-log scale, and for three different values of regularization 
(A=le-4; le-5; le-6). Each plot contains a comparison of CoCoA (red) and CoCoA + (blue), for three different values of H, the 
number of local iterations performed per round. For all plots, across all values of A and H, we see that CoCoA + converges to the 
optimal solution faster than CoCoA, in terms of both the number of communications and the elapsed time. 


chines. The practical variant DisDCA-p proposed in (Yang, 
2013) allows additive updates but is restricted to SDCA 
updates, and was proposed without convergence guaran¬ 
tees. DisDCA-p can be recovered as a special case of the 
CoCoA + framework when using SDCA as a local solver, 
if rik = n/K and o' := K, see Appendix C. The theory 
presented here also therefore covers that method. 

ADMM. An alternative approach to distributed optimiza¬ 
tion is to use the alternating direction method of multipli¬ 
ers (ADMM), as used for distributed SVM training in, e.g., 
(Forero et al., 2010). This uses a penalty parameter balanc¬ 
ing between the equality constraint w and the optimization 
objective (Boyd et ah, 2011). However, the known conver¬ 
gence rates for ADMM are weaker than the more problem- 
tailored methods mentioned previously, and the choice of 
the penalty parameter is often unclear. 

Batch Proximal Methods. In spirit, for the special case 
of adding (7 = 1), CoCoA + resembles a batch proximal 
method, using the separable approximation (9) instead of 
the original dual (2). Known batch proximal methods re¬ 
quire high accuracy subproblem solutions, and don’t allow 
arbitrary solvers of weak accuracy 0 such as we do here. 


7. Numerical Experiments 

We present experiments on several large real-world dis¬ 
tributed datasets. We show that CoCoA + converges faster 
in terms of total rounds as well as elapsed time as compared 
to CoCoA in all cases, despite varying: the dataset, values 
of regularization, batch size, and cluster size (Section 7.2). 
In Section 7.3 we demonstrate that this performance trans¬ 
lates to orders of magnitude improvement in convergence 
when scaling up the number of machines K , as compared 
to CoCoA as well as to several other state-of-the-art meth¬ 
ods. Finally, in Section 7.4 we investigate the impact of the 
local subproblem parameter a' in the CoCoA + framework. 

Table 2. Datasets for Numerical Experiments. 


Dataset 

n 

d 

Sparsity 

covertype 

522,911 

54 

22 .22% 

epsilon 

400,000 

2,000 

100 % 

RCV1 

677,399 

47,236 

0.16% 


7.1. Implementation Details 

We implement all algorithms in Apache Spark (Zaharia 
et al., 2012) and run them on m3.large Amazon EC2 in¬ 
stances, applying each method to the binary hinge-loss sup- 
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port vector machine. The analysis for this non-smooth loss 
was not covered in (Jaggi et al., 2014) but has been captured 
here, and thus is both theoretically and practically justified. 
The used datasets are summarized in Table 2. 

For illustration and ease of comparison, we here use SDCA 
(Shalev-Shwartz & Zhang, 2013c) as the local solver for 
both CoCoA and CoCoA + . Note that in this special case, 
and if additionally o' := K , and if the partitioning rik = 
n/K is balanced, once can show that the CoCoA + frame¬ 
work reduces to the practical variant of DisDCA (Yang, 
2013) (which had no convergence guarantees so far). We 
include more details on the connection in Appendix C. 

7.2. Comparison of CoCoA + and CoCoA 

We compare the CoCoA + and CoCoA frameworks di¬ 
rectly using two datasets (Covertype and RCV1) across var¬ 
ious values of A, the regularizes in Figure 1. For each value 
of A we consider both methods with different values of H, 
the number of local iterations performed before communi¬ 
cating to the master. For all runs of CoCoA + we use the 
safe upper bound of 7 A' for o'. In terms of both the to¬ 
tal number of communications made and the elapsed time, 
CoCoA + (shown in blue) converges to the optimal solu¬ 
tion faster than CoCoA (red). The discrepancy is larger 
for greater values of A, where the strongly convex regular- 
izer has more of an impact and the problem difficulty is re¬ 
duced. We also see a greater performance gap for smaller 
values of //, where there is frequent communication be¬ 
tween the machines and the master, and changes between 
the algorithms therefore play a larger role. 

7.3. Scaling the Number of Machines AT 

In Figure 2 we demonstrate the ability of CoCoA + to 
scale with an increasing number of machines K. The 
experiments confirm the ability of strong scaling of the 
new method, as predicted by our theory in Section 4, 
in contrast to the competing methods. Unlike CoCoA, 
which becomes linearly slower when increasing the num¬ 
ber of machines, the performance of CoCoA + improves 
with additional machines, only starting to degrade slightly 
once A'=16 for the RCV1 dataset. 

7.4. Impact of the Subproblem Parameter o' 

Finally, in Figure 3, we consider the effect of the choice 
of the subproblem parameter o' on convergence. We plot 
both the number of communications and clock time on a 
log-log scale for the RCV1 dataset with A '=8 and //= I e4. 
For 7 = 1 (the most aggressive variant of CoCoA + in 
which updates are added) we consider several different val¬ 
ues of o', ranging from 1 to 8. The value er'=8 represents 
the safe upper bound of 7 AT. The optimal convergence oc¬ 
curs around o'=4, and diverges for o' < 2. Notably, we 



Figure 2. The effect of increasing K on the time (s) to reach an 
£x>-accurate solution. We see that CoCoA + converges twice as 
fast as CoCoA on 100 machines for the Epsilon dataset, and 
nearly 7 times as quickly for the RCV1 dataset. Mini-batch SGD 
converges an order of magnitude more slowly than both methods. 

see that the easy to calculate upper bound of o' := 7 A (as 
given by Lemma 4) has only slightly worse performance 
than best possible subproblem parameter in our setting. 




Figure 3. The effect of o' on convergence of CoCoA + for the 
RCV1 dataset distributed across K =8 machines. Decreasing o' 
improves performance in terms of communication and overall run 
time until a certain point, after which the algorithm diverges. The 
“safe” upper bound of o':=K =8 has only slightly worse perfor¬ 
mance than the practically best “un-safe” value of o'. 

8. Conclusion 

In conclusion, we present a novel framework CoCoA + 
that allows for fast and communication-efficient additive 
aggregation in distributed algorithms for primal-dual opti¬ 
mization. We analyze the theoretical performance of this 
method, giving strong primal-dual convergence rates with 
outer iterations scaling independently of the number of ma¬ 
chines. We extended our theory to allow for non-smooth 
losses. Our experimental results show significant speedups 
over previous methods, including the original CoCoA 
framework as well as other state-of-the-art methods. 

Acknowledgments. We thank Ching-pei Lee and an 
anonymous reviewer for several helpful insights and com¬ 
ments. 
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A. Technical Lemmas 

Lemma 16 (Lemma 21 in (Shalev-Shwartz & Zhang, 2013c)). Let li : R. —> R. be an L-Lipschitz continuous. Then for 
any real value a with |a| > L we have that = oo. 

Lemma 17. Assuming the loss functions t- L are bounded by fj(0) < 1 for all i £ [n] (as we have assumed in (5) above), 
then for the zero vector cf 0 ) := 0 € R™, we have 

V(a*)-V(a. (0) ) = V(a*)-V( 0) < 1. (24) 


Proof For a. := 0 £ R", we have w(a) = = 0 £ R d . Therefore, by definition of the dual objective V given 

in (2), 


(5),(2) 

0<V(a*)-V(a)<V(w(a))-V(a) = 0-V(a) < 1. 


□ 


B. Proofs 

B.l. Proof of Lemma 3 

Indeed, we have 


K n k . K 

V{a + yJ2 Aa lk]) = )^ A ( a + 7 E Ac W 


k=1 


k=1 


fc=l 


(25) 


“V— 

A 


Now, let us bound the terms A and B separately. We have 


A = E ( 55 £ i(- a i -7( Aa [fc])i)) = ( 55 4(-(! -7)«i -7(a + Aa [ft] )i) 

n k =1 \iev k ) n k =1 \iev k / 

- _ ~E ( E - 7)^(-«i) + 73(-(a + A «[fc]W ) • 

71 k =1 \ieV k ) 


Where the last inequality is due to Jensen’s inequality. Now we will bound B, using the safe separability measurement a 1 
as defined in (11). 


B = 


(it) 

< 


! K 2 i K 

— A(a + 7^ Aa [fc] ) = w(a) + 7— ^ AAa 


k=1 
if 


A n 


[*] 


k=1 


if 


'(a)l| 2 + E 2 7^ w («) T AAa [fc] +1(—) 7|| E AAa [k] 

k=l k=1 

if , 0 if 


0)H 2 + E 27 ^w(a) T AAa [fc] + t(^) ^ E II AAc * 


[fell 


fc=l 


fe=l 
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Plugging A and B into (25) will give us 


K 1 K ( 

V(a + 'yJ2 Aa lk]) >~ “53 ( 53 - 7)4(-«i) + 7^*(-(« + Ac *[fc])i) 

k =1 71 k =1 

-7^l|w(a)|| 2 -(l-7)^||w(a)|| 2 - 5 53 2 7^w(a) T AAa [fc] - 5 7 (J-) ^ 53 H AAa [ fc ] H 2 

fc=l k=1 

= -^53 ( X] ^ - 7)4(-«i)] - (! -7)^l|w(a)|| 2 

fc=l \i€Vk / 

s -v*- 

( 1 - 7 ) 23(00 


+ 7^ 



53 ^(-(« + A «[fe])i) 

iev k 


^Hw(a)f 


iC 


^(l-7)%)+ 7 E^ 

fc=l 


(At*[fc]; w, oc[ fc ]). 


1 

—w 

n 


(a) r AAa [jt ] 



Xa AA( ^ 


B.2. Proof of Lemma 4 

See (Richtarik & Takac, 2013). 

B.3. Proof of Lemma 5 

For sake of notation, we will write a instead of a^\ w instead of w(aW) and u instead of u^. 

Now, let us estimate the expected change of the dual objective. Using the definition of the dual update oft 1 1; := oft' + 
7 Aa [k] resulting in Algorithm 1, we have 


K 


E[2?(a w ) -2?(a (t+1) )] =E 22(a) - 22(a + 7 Y Aa [k] ) 


k =1 

(by Lemma 3 on the local function (a; w, ar^i) approximating the global objective V(a)) 

K 

< E 

k =1 
K 


V(a) - (l-'y)D(a) - 7 ^'(A Q [‘j; w, a [fe] ) 

k =1 
K 

= 7 E 22(a) - 5Z^fc'(Aa[*]; w ,a [fe] ) 


fc=l 

K 


K 


K 


= 7 E 22(a) - ^^'(Aaj^w.aj*]) +^^'(Aa* fc] ;w,Q H ) - y]^'(Aa[‘];w,a [fe] ) 

k =1 k =1 k =1 

(by the notion of quality (12) of the local solver, as in Assumption 1) 

✓ K K K 

< 7 ('D(a) - Y a [fc] ) + ©( Y &k( Aa *k }’ w ’ a [ fc l) “ 51 £fc( 0; w ’ a I fe 

' k=1 fc=l fe=l 


V(oc) 

K 

= 7(1 - 0 ) ( 22 (a) - ^ ^'(Aaf fc] ; w, a [fe] )). 
fc= 1 

S -v-" 

c 


( 26 ) 
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Now, let us upper bound the C term (we will denote by Aa* = Aa* fc j): 


I ,b -t ^ \ 

c (2g9) _ ^ _ Aa*) - l*(-ai)) + —w(a) T AAa* + ^ -o' 

fl ^^ 71 Z ^ / 


k =1 




< - (^i(-Q!i - s(ui - cti)) - £*(-ai )) + -w(a) T Hs(u - a) + ^ Z-As(u - a) [fc] 


if 


A_, 


i—1 


k=l 


A 77- 


Strong conv. i JL / a \ i 

< - ^ (s£i(-Ui) + (1 - s)^*(-aj) - -{l - s)s(ui - on ) 2 - £*(-ai)) + -w(a) T Hs(u - a) 

71 \ Z / 71 


K 


+ Ha (T ' ^s( u -a) [fe] 


fc=l 


72 1 K \ I 

^ (s£*(-Wi) - s^*(-ai) - ^(1 - s)s(tii - a*) 2 ) + -w(a) r As(u - a) + ^ -cr' —As(u - a) [k] 
i= 1 fe=l 


The convex conjugate maximal property implies that 

t ) = -u i w(a) T x i - ^(w(cc) T x i ). (27) 

Moreover, from the definition of the primal and dual optimization problems (1), (2), we can write the duality gap as 


G(a ) := P(w(a)) - V(a) (1 =‘ ) ^ ^ (^(xjw) +£*(-ai) + w(a) T x i a J ) . 


(28) 


Hence, 


(27) t 

C < -V' -su i w(o:) T x i - sf i (w(a) T x i ) - sf*(-Qfi)-sw(a) T x i a i + sw(a) T x i a l - s)s(ui - a.;) 

77 z —' \ s----/ Z 


i= 1 

1 

n 


iC 


w(a) T Hs(u — a) + — o' — Hs(u — a)[ fc ] 


fc=i 


An 


n i n 

(-sfi(w(a) T x i ) - s£*(-a,) - sw(a) T x i a i ) + - ^ (sw(a) T X;(ai - n, ; ) - ^(1 - s)s(uj - a^) 2 ) 


K 


+ — w(a) T Hs(u — a) + ^ ^o' 


k =1 


1 2 
—As(u - a) [fc] 


( = —sG(a) - ^(1 - s)s^ ^ 11 u - a|| 2 + ^r(z;) 2 J2 P( u ~ a 


a' s , 
2A n' 




(29) 


fe=i 


Now, the claimed improvement bound (15) follows by plugging (29) into (26). 

B.4. Proof of Lemma 6 

For general convex functions, the strong convexity parameter is /x = 0, and hence the definition of becomes 


K 


K 


K 


R«> S’ £ P(u«) - aC>) It| f ? X>||(u<» - “'"Ml 2 Lel ”” a 16 I>|P t |4£ J . 

k =1 k =1 k =1 
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B.5. Proof of Theorem 8 

At first let us estimate expected change of dual feasibility. By using the main Lemma 5, we have 

E[V(a*) - V(a^ t+1) )] = E[D(a*) - 2?(a (t+1) ) + V(a^) - 1 ?(a«)] 

( = V(a*) - V(a^) - 7(1 - 0)sG(cx (t) ) + 7 (1 - 0)f j{-) 2 R W 

= T>(a*) — V(a^) — 7(1 — 0)s('P(w(a^)) — V(a^)) + 7(1 — ©)fx(— ) 2 R^ 


< V(a*) - V(a^) - 7(1 - 0)g(%*) - %W)) + 7(1 - ©)f U-) 2 R W 


(18) 


< (1 - 7(1 - 0 )s) ( V{CX *) - D(aW)) + 7(1 - ©)&(-)*4LV. 


(30) 


Using (30) recursively we have 


E[X>(«*) - 2?(a (t) )] = (1 - 7(1 - ©)*)* (Z>(a*) - V(a< 0 >)) + 7 (1 - 0)f^(-) 2 4L 2 a^ (1 - 7 (1 - 0)s) j 

n z —' 

i=o 


= (1 - 7(1 - ©)*)* ( 2 ?(a*) - ®(a(°)))+ 7 (l - ©^QWa 
< (1 - 7(1 - ©)«)* (2?(a*) - P(a!»))) + 


7(1 — 0 )s 


4L 2 (J(T , 4L 2 aa' 4L 2 (T(J , 4L 2 aa' 


2A n? 


< 


Choice of s = 1 and t = to := max{0, |"log(2An 2 (27(a*) — V(a^))/(4L 2 acr'))~j } will lead to 

E[V{a*) - V(a w )\ < (1 - 7(1 - 0))*° (V(a*) - V{a {0) )) + 

Now, we are going to show that 

Vf > t 0 : E[X>(c**) -X>(c* (t) )] < 


2A n 2 


2 A n 2 


An 2 


4L 2 cra' 


An 2 (l + 17(1 - 0)(t - t 0 )) 


(31) 


(32) 


(33) 


Clearly, (32) implies that (33) holds for t = to- Now imagine that it holds for any t > 7 , then we show that it also has to 
hold for t + 1. Indeed, using 

S = T- , . -r € [0,1] (34) 


1 + 57(1 - 0)(1 - to) 


we obtain 


E[X>(a*) - £>(a (t+1) )] °< (1 - 7(1 - 0)s) (£>(«*) - Z>(a<*>)) + 7(1 - 0)fr(-) 2 4 L 2 a 


(33) 

< (1 - 7(1 - 0 )s) 


4 L 2 aa' 


An 2 (l + 57(1 - 0)(f - t 0 )) 


+ 7(1 ^ 0)fr( —) 2 4L 2 (T 


( 34 ) 4L 2 crcr' /1 + ^ 7(1 - 0)(f - to) - 7(1 - 0) + 7(1 -0)1 


An 2 


(1+57(1 - ©)(l-*o)) 2 


4L 2 crtj' (1 + 57(1 - 0)(t - f 0 ) - 57(1 - 0) 


An 2 


(l + l7(l-0)(f-f o )) 2 


Now, we will upperbound D as follows 
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D = 


(1 + 57(1 - ©)(1 + 1 - t 0 ))(l + 57(1 - 0 )(t - 1 - t 0 )) 
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where in the last inequality we have used the fact that geometric mean is less or equal to arithmetic mean. 
If a is defined as (21) then we obtain that 


' T—l 


E[G(cf)] = E 

(15),(18) 
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G Eri 




\t~Tn 


< ± 

- T-T 0 
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T—l 


- T-T 0 
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7 (1 — 0)s T — T 0 
1 1 
7 (1 ~0)sT- T 0 
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7(1 — 0)s 


E = («") 

t=To 

(V{a^ t+1) ) - X>(a (t) )) + 4iV,T ' s 


2 \n 2 


V(a (T) ) -X>(a (To) ) 


I 4 L 1 gg' s 
2A n 2 


2?(a*) -P(a (To) ) 


I 4 L 2 acr's 
2An 2 * 


Now, if T > [~ 7 ( 1 j_Q) l + 2o such that Tq > to we obtain 


(35),(33) 

E[G(a)] < 


1 


1 


- 


4L 2 cr <j’ 


7 (1 — 0)s T — T 0 VAn 2 (l + | 7 (1 - 0)(T O - i„)) 
4L 2 (7it' /II 1 


+ 


4L 2 era's 
2A n 2 


Choosing 


gives us 


An 2 \7(1 — 0)s T — To 1 + | 7 (1. — 0)(To — to) 2 

1 


(T — T 0 ) 7 (l — 0) 


E[G(a)] < 


(36),(37) 4L 2 aa' 


e [0, 1 ] 


+ 


An 2 Vl + H 1 - 0 )^-^) (T - To) 7 (1 - 0) 2 


To have right hand side of (38) smaller then < 7 ; it is sufficient to choose Tq and T such that 


4 L 2 aa' 


1 


An 2 \ 1 + ^ 7(1 — 0 )(To — t 0 ) 
4 L 2 <j(t' 


An 2 \{T — To)7(l — 0) 2 


1 

- 2 £G ’ 
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Hence of if 


to + 


/ SL 2 aa' 


7(1 — 0 ) \ An 2 ec 

4L 2 aa' 
Tn 


- 1 


An 2 £G7(l — 0) 


< To, 

< T, 


then (39) and (40) are satisfied. 


(35) 


(36) 

(37) 

(38) 

(39) 

(40) 


B.6. Proof of Theorem 10 

If the function /*(.) is (l//i)-smooth then £*(.) is /i-strongly convex with respect to the || • || norm. From (16) we have 


R(t) (16) _ v^) ||u ( t ) _ a ( t)|1 2 + || j 4(uW - a<*>) 




k =1 
K 


[fc]l 


fc= 1 ^fc||u (t) -affejll 2 


< -Am^£l|| u W - «W|| 2 + a max £ fc=1 Hu« - „ [fc] 

= (-^+^) || U W-aW|| 2 . 


(t )||2 


( 41 ) 
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If we plug 


A /m 


\fin 4- <j max fj 


7 e [0, i] 


( 42 ) 


into (41) we obtain that Vt : < 0. Putting the same s into (15) will give us 

E[£>(o4 t+1) ) - Z>(a«)] ° 5 > 42) 7 (1 - 0)-^-G(a w ) > 7 (1 - 0)-^- V(a*) - V(a^). (43) 

A/in + (T max c^ , A/in + cr max cr' 

Using the fact that E['D(o;( t+1 )) — T>(cx^)] = E[Z?(a( t+1 )) — 2?(a*)] + T>(a*) — D(qW) we have 

E[£>(a (t+1) ) - V(ot*)\ + V(a*) - V{a w ) ( > 7 (1 - 0) -—-X>(a*) - P(a w ) 

A/in + (Tmaxff' 

which is equivalent with 

E[X>(a*) - 2?(a (t+1) )l < (1 - 7 (1 - 0)-^^ £>(a*) - 2?(a (t) ). 

v A/in + cr max cr' / 

Therefore if we denote by = V(a*) — V{a ( X)) we have that 


( 44 ) 


£[ 4 ?] <' ( 1 - 7 ( 1 -©) 


(44) 


A/in 


jo) 


(24) 


' A/in + cr max cr' y V 
The right hand side will be smaller than some fp if 


< 1- 7 (l-0) 


A fin 


XflTl CTmax^ 


7 ^ < exp ^-f 7 (l - 0) 


A/m 


A/in + cr max cr' 


. . 1 A/in + <r max cr 1 

t > — -— -7- lOg—. 

e-D 


7(1 — 0 ) A/in 


Moreover, to bound the duality gap, we have 

A/m 


7(1-©)- 


(43) 

G{a (t) ) < E[ 2 ?(a (t+1) ) -V{a (t) )\ < E[X>(a*) -£>(a (t) )]. 


A/in + er max er' 

Therefore G(a^) < 7 ( 1 ie) e v ■ H ence if e v < 7(1 — 0)177777 — 77e G then G(a^) < e G . Therefore after 

1 A/in + <7 max (/ 1 


, . 1 A/in + cr max cr' 

'Surrey-XT— g 

iterations we have obtained a duality gap less than e G . 


7(1 - 0 ) A/in e G 


B.7. Proof of Theorem 13 

Because £i are (l//i)-smooth then functions i* are /i strongly convex with respect to the norm || ■ ||. The proof is based on 
techniques developed in recent coordinate descent papers, including (Richtarik & Takac, 2014; 2013; Richtarik & Takac, 
2015; Tappenden et ah, 2015; Marecek et ah, 2014; Fercoq & Richtarik, 2013; Lu & Xiao, 2013; Fercoq et ah, 2014; 
Qu & Richtarik, 2014; Qu et ah, 2014) (Efficient accelerated variants were considered in (Fercoq & Richtarik, 2013; 
Shalev-Shwartz & Zhang, 2013a)). 

First, let us define the function F(£) : R” fc —► R as F(£) := —Q% {J2iev k Ct e F w > c*[fc])- This function can be written in 
two parts F(Q = <&(£) + /(£). The first part denoted by $(C) = 7 Yliev ( _ tti ~ C») I s strongly convex with convexity 
parameter 7 with respect to the standard Euclidean norm. In our application, we think of the C variable collecting the local 
dual variables Aa^, 

The second part we will denote by /(£) = ^f||w(a )|| 2 + T J2iev k ™( a ) Tyi iCi + fo - ' 7772 II J2iev k x iCt|| 2 - is easy 
to show that the gradient of / is coordinate-wise Lipschitz continuous with Lipschitz constant -r max with respect to the 
standard Euclidean norm. 
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Following the proof of Theorem 20 in (Richtarik & Takac, 2015), we obtain that 

Y I tm\ 

TJ'\r ,r7 ' f A * \ f->n' / A (/l + r) M ^ / -l 

HQ 


f(Aa* fc] ;W,a [fe] ) -^fe , ( Aa [fe] +1) i W > a [fe])] ^ f 1 - 


= 1 - 


n k 

1 


( h ) 
*[*]' 


^ (^fc , ( A «ffc]; w I «[fe])-^fc , (Aa[fe ] ) ;w,a [fc] 


n k <y'r max + An/r 


Over all steps up to step h, this gives 

^(Aa^w^it]) - ^'(Aa^;w,a w )] < ( x - 


1 An/r 
n k a' r max + Arc/x 


(^(Aa^w.aitj) -0fc'(O;w,a [fc] )) . 


Therefore, choosing H as in the assumption of our Theorem, given in Equation (22), we are guaranteed that 
' H 


o - 


Xnfi 


nk <J r r max -\-\nfi ) — 


< ©, as desired. 


B.8. Proof of Theorem 14 


Similarly as in the proof of Theorem 13 we define a composite function F(Cf) = /(£) + $(C). However, in this case func¬ 
tions £* are not guaranteed to be strongly convex. However, the first part has still a coordinate-wise Lipschitz continuous 
gradient with constant 5^2 r max with respect to the standard Euclidean norm. Therefore from Theorem 3 in (Tappenden 
et ah, 2015) we have that 


HQl (Aof fc] ;w ,a [k] )-gi (An 


(©., 

[fc] > 


e [fe])] A 


n k 


n k + h 


Ql (Aa 


[fc]> 


*[fc]) - Ql (0; w,a [fc] ) 


1 & r max 

2 An 2 



(45) 


Now, choice of h = H from (23) is sufficient to have the right hand side of (45) to be < 0 (£fc'(Aa[ fc] ;w,a [fc] ) - 
Ql\ 0 ;w,a [fc] )). 


C. Relationship of DisDCA to CoCoA + 

We are indebted to Ching-pei Lee for showing the following relationship between the practical variant of DisDCA (Yang, 
2013), and CoCoA + when SDCA is chosen as the local solver: 

Considering the practical variant of DisDCA (DisDCA-p, see Figure 2 in (Yang, 2013)) using the scaling parameter scl = 
K, the following holds: 

Lemma 18. Assume that the dataset is partitioned equally between workers, i.e. Vfc : n k = If within the CoCoA + 
framework, SDCA is used as a local solver, and the subproblems are formulated using our shown “safe” (but pessimistic) 
upper bound of a' = K, with aggregation parameter 7=1 (adding), then the C 0 C 0 A + framework reduces exactly to the 
DisDCA-p algorithm. 


Proof (Due to Ching-pei Lee, with some reformulations). As defined in (9), the data-local subproblem solved by each 
machine in CoCoA + is defined as 

nmx. Sfc'(Aa [fe] ;w,a [fc] ) 

where 


5*'(Aa [t] ;w,a [t] ) := 55 - (Aa [fc] )*) - -^||w|| 2 - *w T AAa: w - 


i^Vk 


ln AAa W 


We rewrite the local problem by scaling with n, and removing the constant regularizer term 7 ? 4||w|| , i.e. 


5t'(Aa w ;w) := - 55 “ ( Aa [k])j) ~ w T AAc* [fc] - 

jev k 


a' 

2A n 


4Aa 


[k] 


(46) 
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For the correspondence of interest, we now restrict to single coordinate updates in the local solver. In other words, the 
local solver optimizes exactly one coordinate i £ Vk at a time. To relate the single coordinate update to the set of local 
variables, we will use the notation 

A«[ fc] =: Aa^' + 6e z , (47) 

so that Aa^ v are the previous local variables, and Aq^j will be the updated ones. 

From now on, we will consider the special case of CoCoA + when the quadratic upper bound parameter is chosen as the 
“safe” value a ' = K , combined with adding as the aggregation, i.e. 7 = 1 . 

Now if the local solver within CoCoA + is chosen as LocalSDCA, then one local step on the subproblem (9) will 
calculate the following coordinate update. Recall that A = [xi, X2,..., x n ] £ R dxn . 

S* := argmax ^'(Aaru;w) (48) 

<5eR 

which - because it is only affecting one single coordinate, employing (47) - can be expressed as 

5* := argmax -^(-(a* + + 6)) - Sxfw - AAa^ - ^<5 2 ||x, ; ||2 

= argmax -£*(-(ai + (Aa^); + <5)) - <5xf ( w + ^AAa^ e ] v j (49) 

_. u local 

From this formulation, it is apparent that single coordinate local solvers should maintain their locally updated version of 
the current primal parameters, which we here denote as 

ulocal = w + ^-AAaffi . ( 50 ) 


In the practical variant of DisDCA, the summarized local primal updates are Au local = A^AAajj]. For the balanced case 
rife = n/K for K being the number of machines, this means the local u local update of DisDCA-p is 


Aq* := argmax —i *(— (a 4 + Ac**)) — Act^xf u local 

AaiSR 


K 
2 A n 


(Aa i ) 2 ||x i || 


2 

2 • 


(51) 


It is not hard to show that during one outer round, the evolution of the local dual variables Aa^j is the same in both 
methods, such that they will also have the same trajectory of u local . This requires some care if the same coordinate is 
sampled more than once in a round, which can happen in LocalSDCA within CoCoA + and also in DisDCA-p. □ 


Discussion. In the view of the above lemma, we will summarize the connection of the two methods as follows: 


• C 0 C 0 A/+ is Not an Algorithm. In contrast, it is a framework which allows to use any local solver to perform 
approximate steps on the local subproblem. This additional level of abstraction (from the definition of such local 
subproblems in (9)) is the first to allow reusability of any fast/tuned and problem specific single machine solvers, 
while decoupling this from the distributed algorithmic scheme, as presented in Algorithm 1 . 

Concerning the choice of local solver to be used within C 0 C 0 A/+, SDCA is not the fastest known single machine 
solver for most applications. Much recent research has shown improvements on SDCA (Shalev-Shwartz & Zhang, 
2013c), such as accelerated variants (Shalev-Shwartz & Zhang, 2013b) and other approaches including variance re¬ 
duction, methods incorporating second-order information, and importance sampling. In this light, we encourage the 
user of the C 0 C 0 A or CoCoA + framework to plug in the best and most recent solver available for their particular 
local problem (within Algorithm 1), which is not necessarily SDCA. This choice should be made explicit especially 
when comparing algorithms. Our presented convergence theory from Section 4 will still cover these choices, since it 
only depends on the relative accuracy 0 of the chosen local solver. 
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• CoCoA + is Theoretically Safe, while still Adaptive to the Data. The general definition of the local subproblems, 
and therefore the treatment of the varying separable bound on the objective - quantified by o' - allows our framework 
to adapt to the difficulty of the data partition and still give convergence results. The data-dependent measure o' is 
fully decoupled from what the user of the framework prefers to employ as a local solver (see also the comment below 
that CoCoA is not a coordinate solver). 

The safe upper bound o' = I\ is worst-case pessimistic, for the convergence theory to still hold in all cases, when the 
updates are added. Using additional knowledge from the input data, better bounds and therefore better step-sizes can 
be achieved in CoCoA + . An example when o' can be safely chosen much smaller is when the data-matrix satisfies 
strong row/column sparsity, see e.g. Lemma 1 in (Richtarik & Takac, 2013). 

• Obtaining DisDCA-p as a Special Case. As shown in Lemma 18 above, we have that if in CoCoA + , if SDCA is 
used as the local solver and the pessimistic upper bound of o' = I\ is used and, moreover, the dataset is partitioned 
equally, i.e. V/c : rik = then the CoCoA + framework reduces exactly to the DisDCA-p algorithm by (Yang, 2013). 

The correspondence breaks down if the subproblem parameter is chosen to a practically good value o' 7 ^ K. Also, as 
noted above, SDCA is often not the best local solver currently available. In our above experiments, SDCA was used 
just for demonstration purposes and ease of comparison. Furthermore, the data partition might often be unbalanced 
in practical applications. 

While both DisDCA-p and CoCoA are special cases of CoCoA + , we note that DisDCA-p can not be recovered as a 
special case of the original CoCoA framework (Jaggi et ah, 2014). 

• C 0 C 0 A/+ are Not Coordinate Methods. Despite the original name being motivated from this special case, CoCoA 
and CoCoA + are not coordinate methods. In fact, CoCoA + as presented here for the adding case (7 = 1) is much 
more closely related to a batch method applied to the dual, using a block-separable proximal term, as following 
from our new subproblem formulation (9), depending on o'. See also the remark in Section 6 . The framework here 
(Algorithm 1) gives more generality, as the used local solver is not restricted to be a coordinate-wise one. In fact the 
framework allows to translate recent and future improvements of single machine solvers directly to the distributed 
setting, by employing them within Algorithm 1 . DisDCA-p works very well for several applications, but is restricted 
to using local coordinate ascent (SDCA) steps. 

• Theoretical Convergence Results. While DisDCA-p (Yang, 2013) was proposed without theoretical justification 
(hence the nomenclature), the main contribution in the paper here - apart from the arbitrary local solvers - is the 
convergence analysis for the framework. The theory proposed in (Yang et ah, 2013) is given only for the setting of 
orthogonal partitions, i.e., when o' = 1 and the problems become trivial to distribute given the orthogonality of data 
between the workers. 

The theoretical analysis here gives convergence rates applying for Algorithm 1 when using arbitrary local solvers, 
and inherits the performance of the local solver. As a special case, we obtain the first theoretical justification and 
convergence rates for original CoCoA in the case of general convex objective, as well as for the special case of 
DisDCA-p for both general convex and smooth convex objectives. 



